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Rejoinder: Boosting Algorithms: 
Regularization, Prediction and Model 
Fitting 

Peter Buhlmann and Torsten Hothorn 

1. DEGREES OF FREEDOM FOR BOOSTING of freedom are given by 



We are grateful that Hastie points out the connec- 
tion to degrees of freedom for LARS which leads to 
another — and often better — definition of degrees of 
freedom for boosting in generalized linear models. 

As Hastie writes and as we said in the paper, our 
formula for degrees of freedom is only an approxima- 
tion: the cost of searching, for example, for the best 
variable in componentwise linear least squares or 
componentwise smoothing splines, is ignored. Hence, 
our approximation formula 

df(m) =trace(£> m ) 

for the degrees of freedom of boosting in the ?nth 
iteration is underestimating the true degrees of free- 
dom. The latter is defined (for regression with L2- 
loss) as 

n 

dftrue(m) = £ CoviY^/al Y = B m Y, 

i=l 

cf. Efron et al. [5]. 

For fitting linear models, Hastie illustrates nicely 
that for infinitesimal forward stagewise (iFSLR) and 
the Lasso, the cost of searching can be easily ac- 
counted for in the framework of the LARS algo- 
rithm. With k steps in the algorithm, its degrees 
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dfLARsO) = k - 

For quite a few examples, this coincides with the 
number of active variables (variables which have been 
selected) when using k steps in LARS, that is, 

dfLARs(fc) ~ df actse t(&) 

= cardinality of active set. 

Note that the number of steps in dfLARS is not 
meaningful for boosting with componentwise linear 
least squares while df ac t se t(^) for boosting with m 
iterations can be used (and often seems reasonable; 
see below) . We point out that df (m) and df ac tset (jn) 
are random (and hence they cannot be degrees of 
freedom in the classical sense). We will discuss in 
the following whether they are good estimators for 
the true (nonrandom) df tr ue(w)- 

When using another base procedure than compo- 
nentwise linear least squares, for example, compo- 
nentwise smoothing splines, the notion of df ac t se t (fi) 
is inappropriate (the number of selected covariates 
times the degrees of freedom of the base procedure 
is not appropriate for assigning degrees of freedom) . 

We now present some simulated examples where 
we can evaluate the true dftrue for Z^Boosting. The 
first two are with componentwise linear least squares 
for fitting a linear model and the third is with com- 
ponentwise smoothing splines for fitting an additive 
model. The models are 
v 

Yi = Y J + Ei ~ AT(0, 1) i.i.d, 
7=1 

i = l,...,n, 

with fixed design from Af p (0, £), Ejj =0.5l i-3 'l, peff 
nonzero regression coefficients (3j and with parame- 
ters 
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p = 10, peff = l, 
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Fig. 1. Model (1) and boosting with componentwise linear least squares (u = 0.1). True degrees of freedom dftrue(m) (dashed 
black line) and df(m) (shaded gray lines, left panel) and df actsot (m) (shaded gray lines, right panel) from 100 simulations. 
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(i = (1, 1, 1, 1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0, 0, . . . ). 



All models (l)-(3) have the same signal-to-noise ra- 
tio. In addition, we consider the Friedman #1 addi- 
tive model with p = 20 and peff = 5: 



Yi = 10sin(7rx 



+ 20(xf ) - 0.5) 



+ 10xf } +hxf ] +£i, 



1. 



, n, 



with fixed design from U[0, l] 20 and i.i.d. errors £j 
AA(0,cj 2 ), i = 1, ... ,n where 

-I 



(4) 
(5) 



1, 
10. 



Figures 1-4 display the results. As already men- 
tioned, our approximation df (m) underestimates the 
true degrees of freedom. Hence, our penalty term in 
AIC or similar information criteria tends to be too 
small. Furthermore, our df(m) is less variable than 
dfactset(^)- When looking in detail to the sparse 
cases from model (1) and (2) in Figures 1 and 2, re- 
spectively: (i) our df (m) is accurate for the range of 
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Fig. 2. Model (2) and boosting with componentwise linear least squares (v = 0.1). Other specifications as in Figure 1. 
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Fig. 3. Model (3) and boosting with componentwise linear least squares (v = 0.1). Other specifications as in Figure 1. 



iterations which are reasonable (note that we should 
not spend more degrees of freedom than, say, 2-3 if 
peff = 1; OLS on the single effective variable, in- 
cluding an intercept, would have df trU e = 2); (ii) the 
active set degrees of freedom are too large for the 
first few values of m, that is, df actset (t^) = 2 (one 
variable and the intercept) although dftrue(^) < 1-5 
for m < 5. Such a behavior disappears in the less 
sparse case in model (3), which is an example where 
df(m) underestimates very heavily; see Figure 3. 

Despite some (obvious) drawbacks of df actset ("i) j 
it works reasonably well. Hastie has asked us to give 
a correction formula for our df(m). His discussion 
summarizing the nice relation between LARS, iF- 
SLR and ^Boosting, together with our simulated 



examples, suggests that df ac tsct("i) is a better ap- 
proximation for degrees of freedom for boosting with 
the componentwise linear base procedure. We have 
implemented df ac tsct ( m ) in version 1.0-0 of the mboost 
package [9] for assigning degrees of freedom of boost- 
ing with componentwise linear least squares for gen- 
eralized linear models. Unfortunately, in contrast to 
LARS, df actset (m) will never be exact. It seems that 
assigning correct degrees of freedom for boosting is 
more difficult than for LARS. For other learners, 
for example, the componentwise smoothing spline, 
we do not even have a better approximation for de- 
grees of freedom. Our formula df(m) worked reason- 
ably well for the models in (4) and (5); changing the 




Fig. 4. Left: model (4). Right: model (5). Boosting with componentwise smoothing splines with four degrees of freedom 
(u — 0.1). True degrees of freedom dftrue(wi) (dashed black line) and df(m) (shaded gray lines, for both panels). 
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signal-to-noise ratio by a factor 10 gave almost iden- 
tical results (which is unclear a priori because df (m) 
depends on the data). But this is no guarantee for 
generalizing to other settings. In absence of a better 
approximation formula in general, we still think that 
our df(m) formula is useful as a rough approxima- 
tion for degrees of freedom of boosting with com- 
ponentwise smoothing splines. And we agree with 
Hastie that cross-validation is a valuable alterna- 
tive for the task of estimating the stopping point of 
boosting iterations. 

2. HISTORICAL REMARKS AND NUMERICAL 
OPTIMIZATION 

Buja, Mease and Wyner (BMW hereafter) make 
a very nice and detailed contribution regarding the 
history and development of boosting. 

BMW also ask why we advocate Friedman's gra- 
dient descent as the boosting standard. First, we 
would like to point out that computational efficiency 
in boosting does not necessarily yield better statisti- 
cal performance. For example, a small step-size may 
be beneficial in comparison to step-size v = 1, say. 
Related to this fact, the quadratic approximation of 
the loss function as described by BMW may not be 
better than the linear approximation. To exemplify, 
take the negative log-likelihood loss function in (3.1) 
for binary classification. When using the linear ap- 
proximation, the working response (i.e., the negative 
gradient) is 

^,linapp = 2(?/i -p(xi)), Vi £ {0, 1}. 

In contrast, when using the quadratic approxima- 
tion, we end up with LogitBoost as proposed by 
Friedman, Hastie and Tibshirani [7]. The working 
response is then 

1 yi — p(xi) 
2 Ww = 2p( x .)(i_ p(x| ))- 

The factor 1/2 appears in [7] when doing the lin- 
ear update but not for the working response. We 
see that Zi iqua dapp is numerically problematic when- 
ever p(xi) is close to or 1, and [7], on pages 352- 
353, address this issue by thresholding the value of 
Zj q Ua dapp to an "ad hoc" upper limit. On the other 
hand, with the linear approximation and z^Knapp, 
such numerical problems do not arise. This is a rea- 
son why we generally prefer to work with the linear 
approximation and Friedman's gradient descent al- 
gorithm [6]. 



BMW also point out that there is no "random el- 
ement" in boosting. In our experience, aggregation 
in the style of bagging is often very useful. A combi- 
nation of boosting with bagging has been proposed 
in Buhlmann and Yu [2] and similar ideas appear 
in Friedman [8] and Dettling [4]. In fact, random 
forests [1] also involve some bootstrap sampling in 
addition to the random sampling of covariates in the 
nodes of the trees; without the bootstrap sampling, 
it would not work as well. We agree with BMW that 
quite a few methods actually benefit from additional 
bootstrap aggregation. Our paper, however, focuses 
solely on boosting as a "basic module" without (or 
before) random sampling and aggregation. 

3. LIMITATIONS OF THE "STATISTICAL 
VIEW" OF BOOSTING 

BMW point out some limitations of the "statisti- 
cal view" (i.e., the gradient descent formulation) of 
boosting. We agree only in part with some of their 
arguments. 

3.1 Conditional Class Probability Estimation 

BMW point out that conditional class probabili- 
ties cannot be estimated well by either AdaBoost or 
LogitBoost, and later in their discussion they men- 
tion that overfitting is a severe problem. Indeed, 
the amount of regularization for conditional class 
probability estimation should be (markedly) differ- 
ent than for classification. For probability estima- 
tion we typically use (many) fewer iterations, that 
is, a less complex fit, than for classification. This 
fits into the picture of the rejoinder in [7] and [2], 
saying that the 0-1 misclassification loss in (3.2) is 
much more insensitive to overfitting. For accurate 
conditional class probability estimation, we should 
use the surrogate loss, for example, the negative log- 
likelihood loss in (3.1), for estimating (e.g., via cross- 
validation) a good stopping iteration. Then, condi- 
tional class probability estimates are often quite rea- 
sonable (or even very accurate), depending of course 
on the base procedure, the structure of the underly- 
ing problem and the signal-to-noise ratio. We agree 
with BMW that AdaBoost or LogitBoost overfit for 
conditional class probability estimation when using 
the wrong strategy — namely, tuning the boosting al- 
gorithm according to optimal classification. Thus, 
unfortunately, the goals of accurate conditional class 
probability estimation and good classification are in 
conflict with each other. This is a general fact (see 
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rejoinder by Friedman, Hastie and Tibshirani [7]) 
but it seems to be especially pronounced with boost- 
ing complex data. Having said that, we agree with 
BMW that AIC/BIC regularization with the neg- 
ative log-likelihood loss in (3.1) for binary classifi- 
cation will be geared toward estimating conditional 
probabilities, and for classification, we should use 
more iterations (less regularization). 

3.2 Robustness 

For classification, BMW argue that robustness in 
the response space is not an issue since, "binary re- 
sponses have no problem of vertically outlying val- 
ues." We disagree with the relevance of their argu- 
ment. For logistic regression, robustification of the 
MLE has been studied in detail. Even though the 
MLE has bounded influence, the bound may be too 
large and for practical problems this may matter a 
lot. Kiinsch, Stefanski and Carroll [10] is a good ref- 
erence which also cites earlier papers in this area. 
Note that with the exponential loss, the issue of too 
large influence is even more pronounced than with 
the log-likelihood loss corresponding to the MLE. 

4. EXEMPLIFIED LIMITATIONS OF THE 
"STATISTICAL VIEW" 

The paper by Mease and Wyner [11] presents some 
"contrary evidence" to the "statistical view" of boost- 
ing. We repeat some of the points made by Buhlmann 
and Yu [3] in the discussion of Mease and Wyner's 
paper. 



4.1 Stumps Should be Used for Additive Bayes 
Decision Rules 

The sentence in the subtitle which is put forward, 
discussed and criticized by BMW never appears in 
our paper. The main source of confusion seems to be 
the concept of "additivity" of a function. It should 
be considered on the logit-scale (for AdaBoost, Log- 
itBoost or BinomialBoosting) , since the population 
minimizer of AdaBoost, LogitBoost or BinomialBoost- 
ing is half of the log-odds ratio. Mease and Wyner 
[11] created a simulation model which is additive 
as a decision function but nonadditive on the logit- 
scale for the conditional class probabilities; and they 
showed that larger trees are then better than stumps 
(which is actually consistent with what we write in 
our paper). We think that this is the main reason 
why Mease and Wyner [11] found "contrary evi- 
dence." 

We illustrate in Figure 5 that our heuristics to 
prefer stumps over larger trees is useful if the un- 
derlying model is additive for the logit of the condi- 
tional class probabilities. The simulation model here 
is the same as in [3] which we used to address the 
"contrary evidence" findings in [11]; our model is in- 
spired by Mease and Wyner [11] but we make the 
conditional class probabilities additive on the logit- 
scale: 

5 

logit(p(X))=8]T(A^)-0.5), 
i=i 

(6) 

Y~ Bernoulli(ppT)), 

and A~^[0,1] 20 (i.e., i.i.d. U[0, 1]). This model has 
Bayes error rate approximately equal to 0.1 (as in 



mis classification test error surrogate test error absolute error for probabilities 
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Fig. 5. BinomialBoosting (y — 0.1) with stumps (solid line) and larger trees (dashed line) for model (6). Left panel: test-set 
misclassification error; middle panel: test-set surrogate loss; right panel: test-set absolute error for probabilities. Averaged over 
50 simulations. 
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[11]). We use n = 100, p = 20 (i.e., 15 ineffective pre- 
dictors), and we generate test sets of size 2000. We 
consider BinomialBoosting with stumps and with 
larger trees whose varying size is about 6-8 ter- 
minal nodes. We consider the misclassification test 
error, the test-set surrogate loss with the negative 
log-likelihood and the absolute error for probabili- 
ties 

-. 2000 

ssoEw*)-**)!. 

where averaging is over the test set. Figure 5 dis- 
plays the results (the differences between stumps 
and larger trees are significant) which are in line 
with the explanations and heuristics in our paper 
but very different from what BMW describe. To re- 
iterate, we think that the reason for the "contrary 
evidence" in Mease and Wyner [11] comes from the 
fact that their model is not additive on the logit- 
scale. We also see from Figure 5 that early stopping 
is important for probability estimation, in particular 
when measuring in terms of test-set surrogate loss; a 
bit surprisingly, BinomialBoosting with stumps does 
not overfit within the first 1000 iterations in terms 
of absolute errors for conditional class probabilities 
(this is probably due to the low Bayes error rate of 
the model; eventually, we will see overfitting here as 
well). Finally, Biihlmann and Yu [3] also argue that 
the findings here also appear when using "discrete 
AdaBoost." 

In our opinion, it is exactly the "statistical view" 
which helps to explain the phenomenon in Figure 5. 
The "parameterization" with stumps is only "effi- 
cient" if the model for the logit of the conditional 
class probabilities is additive; if it is nonadditive 
on the logit-scale, it can easily happen that larger 
trees are better base procedures, as found indeed by 
Mease and Wyner [11]. 

4.2 Early Stopping Should be Used to Prevent 
Overfitting 

BMW indicate that early stopping is often not 
necessary — or even degrades performance. One 
should be aware that they consider the special case 
of binary classification with "discrete AdaBoost" and 
use trees as the base procedure. Arguably, this is the 
original proposal and application of boosting. 

In our exposition, though, we not only focus on bi- 
nary classification but on many other things, such as 
estimating class conditional probabilities, regression 
functions and survival functions. As BMW write, 



when using the surrogate loss for evaluating the per- 
formance of boosting, overfitting kicks in quite early 
and early stopping is often absolutely crucial. It is 
dangerous to present a message that early stopping 
might degrade performance: the examples in Mease 
and Wyner [11] provide marginal improvements of 
about 1-2% without early stopping (of course, they 
also stop somewhere) while the loss of not stopping 
early can be huge in applications other than classi- 
fication. 

4.3 Shrinkage Should be Used to Prevent 
Overfitting 

We agree with BMW that shrinkage does not al- 
ways improve performance. We never stated that 
shrinkage would prevent overfitting. In fact, in lin- 
ear models, infinitesimal shrinkage corresponds to 
the Lasso (see Section 5.2.1) and clearly, the Lasso 
is not free of overfitting. In our view, shrinkage adds 
another dimension of regularization. If we do not 
want to tune the amount of shrinkage, the value 
v = 0.1 is often a surprisingly good default value. 
Of course, there are examples where such a default 
value is not optimal. 

4.4 The Role of the Surrogate Loss Function 
and Conclusions From BMW 

BMW's comments on the role of the surrogate loss 
function when using a particular algorithm are in- 
triguing. Their algorithm can be viewed as an en- 
semble method; whether we should call it a boosting 
algorithm is debatable. And for sure, their method 
is not within the framework of functional gradient 
descent algorithms. 

BMW point out that there are still some mys- 
teries about AdaBoost. In our view, the overfitting 
behavior is not well understood while the issue of 
using stumps versus larger tree base procedures has 
a coherent explanation as pointed out above. There 
are certainly examples where overfitting occurs with 
AdaBoost. The (theoretical) question is whether there 
is a relevant class of examples where AdaBoost is 
not overfitting when running infinitely many itera- 
tions. We cannot answer the question with numer- 
ical examples since "infinitely many" can never be 
observed on a computer. The question has to be 
answered by rigorous mathematical arguments. For 
practical purposes, we advocate early stopping as a 
good and important recipe. 
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